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We consider the problems of estimation and selection of parame- 
ters endowed with a known group structure, when the groups are as- 
sumed to be sign-coherent, that is, gathering either nonnegative, non- 
positive or null parameters. To tackle this problem, we propose the 
cooperative-Lasso penalty. We derive the optimality conditions defin- 
ing the cooperative-Lasso estimate for generalized linear models, and 
propose an efficient active set algorithm suited to high-dimensional 
problems. We study the asymptotic consistency of the estimator in 
the linear regression setup and derive its irrepresentable conditions, 
which are milder than the ones of the group-Lasso regarding the 
matching of groups with the sparsity pattern of the true parame- 
ters. We also address the problem of model selection in linear regres- 
sion by deriving an approximation of the degrees of freedom of the 
cooperative-Lasso estimator. Simulations comparing the proposed es- 
timator to the group and sparse group-Lasso comply with our theo- 
retical results, showing consistent improvements in support recovery 
for sign-coherent groups. We finally propose two examples illustrating 
the wide applicability of the cooperative-Lasso: first to the processing 
of ordinal variables, where the penalty acts as a monotonicity prior; 
second to the processing of genomic data, where the set of differen- 
tially expressed probes is enriched by incorporating all the probes of 
the microarray that are related to the corresponding genes. 

1. Introduction. This paper addresses the problems of estimation and 
inference of parameters when a group structure among parameters is known. 
We propose a new penalty for the case where the groups are assumed to 



Received March 201f; revised September 20f f. 

^Supported in part by the PASCAL2 Network of Excellence, the European fCT FP7 
Grant 247022— MASH and the French National Research Agency (ANR) Grant ClasSel 
ANR-08-EMER-002. 

Key words and phrases. Penalization, sparsity, grouped variables, ordinal variables, 
continuous variables, sign-coherence, microarray analysis. 

This is an electronic reprint of the original article published by the 
histitutc of Mathematical Statistics in The Annals of Applied Statistics, 
2012, Vol. 6, No. 2, 795-830. This reprint differs from the original in pagination 
and typographic detail. 



1 



2 



J. CHIQUET, Y. GRANDVALET AND C. CHARBONNIER 



gather either nonpositive, nonnegative or null parameters. All such groups 
will be referred to as sign-coherent. 

As the main motivating example, we consider the linear regression model 

K 

(1) y = x/3* + e = ^^x,-/3; + e, 

A:=l jeSfc 

where y is a continuous response variable, X = (Xi, . . . , Xp) is a vector of p 
predictor variables, /3* is the vector of unknown parameters and e is a zero- 
mean Gaussian error variable with variance cj^ . The set of indexes { 1 , . . . , p} 
is partitioned into K groups {Gk}f=i corresponding to predictors and pa- 
rameters. We will assume throughout this paper that (3* has few nonzero 
coefficients, with sparsity and sign patterns governed by the groups Qk, that 
is, groups being likely to gather either positive, negative or null parameters. 

The estimation and inference of /3* is based on training data, consisting of 
a vector y = (yi, . . . , for responses and a n x p design matrix X whose 
jth. column contains Xj = (xj, . . . the n observations for variable Xj. 

For clarity, we assume that both y and {xj}j=i^,,.^p are centered so as to 
eliminate the intercept from fitting criteria. 

Penalization methods that build on the £i-norm, referred to as Lasso 
procedures (Least Absolute Shrinkage and Selection Operator), are now 
widely used to tackle simultaneously variable estimation and selection in 
sparse problems. Among these, the group-Lasso, independently proposed by 
Grandvalet and Canu (1999) and Bakin (1999) and later developed by Yuan 
and Lin (2006), uses the group structure to define a shrinkage estimator of 
the form 

(2) ^s"-P = argmin(i||y-X/3f + A^«;fc||/3gJ||, 

where Qk is the subset of indices defining the kth group of variables and 
II • II is the Euclidean norm. The tuning parameter A > controls the overall 
amount of penalty and weights > adapt the level of penalty within 
a given group. Typically, one sets = ^Jpk-, where is the cardinality 
of ^fc in order to adjust shrinkage according to group sizes. The penalizer 
in (2) is known to induce sparsity at the group level, setting a whole group 
of parameters to zero for values of A which are large enough. Note that 
when we assign one group to each predictor, we recover the original Lasso 
[Tibshirani (1996)]. 

The algorithms for finding the group-Lasso estimator have considerably 
improved recently. Foygel and Drton (2010) develop a block-wise algorithm, 
where each group of coefficients is updated at a time, using a single line 
search that provides the exact optimal value for one group, considering all 
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other coefficients fixed. Meier, van de Geer and Biihlmann (2008) depart 
from linear regression in problem (2) by studying group-Lasso penalties for 
logistic regression. Their block-coordinate descent method is applicable to 
generalized linear models. Here, we build on the subdifferential calculus ap- 
proach originally proposed by Osborne, Presnell and Turlach (2000) for the 
Lasso, whose active set algorithm has been adapted to the group-Lasso [Roth 
and Fischer (2008)]. 

Compared to the group-Lasso, this paper deals with a stronger assump- 
tion regarding the group structure. Groups should not only reveal the spar- 
sity pattern, but they should also be relevant for sign patterns: all coeffi- 
cients within a group should be sign-coherent, that is, they should either 
be null, nonpositive or nonnegative. This desideratum arises often when 
the groups gather redundant or consonant variables (a usual outcome when 
groups are defined from clusters of correlated variables). To perform this 
sign-coherent grouped variable selection, we propose a novel penalty that 
we call the cooperative-Lasso, in short the coop-Lasso. 

The coop-Lasso is amenable to the selection of patterns that cannot be 
achieved with the group-Lasso. This ability, which can be observed for finite 
samples, also leads to consistency results under the mildest assumptions. 
Indeed, the consistency results for the group-Lasso assume that the set of 
nonzero coefficients of /3* is an exact union of groups [Bach (2008); Nardi and 
Rinaldo (2008)], while exact support recovery may be achieved with coop- 
Lasso when some zero coefficients belong to a group having either positive or 
negative coefficients. For example, with groups Qi = {1, 2} and Q2 = {3, 4, 5}, 
the support of /3* = (—1,1,0,1,1)^ may be recovered with the coop-Lasso, 
but not with the group-Lasso, which may then deteriorate the performances 
of the Lasso [Huang and Zhang (2010)]. Friedman, Hastie and Tibshirani 
(2010) propose to overcome this restriction by adding an li penalty to the 
objective function in (2), in the vein of the hierarchical penalties of Zhao, 
Rocha and Yu (2009). The new term provides additional flexibility but de- 
mands an additional tuning parameter, while our approach takes a different 
stance by assuming sign-coherence, with the benefit of requiring a single 
tuning parameter. 

Section 6 describes two applications where sign-coherence is a sensible as- 
sumption. The first one considers ordered categorical data, which are com- 
mon in regression and classification. The coop-Lasso can then be used to 
induce a monotonic response to the ordered levels of a covariate, without 
translating each level of the categorical variable into a prescribed quantita- 
tive value. The second application describes the situation where redundancy 
in measurements causes sign-coherence to be expected. Similar behaviors 
should be observed when features have been grouped by a clustering algo- 
rithm such as average linkage hierarchical clustering, which are nowadays 
routinely used for grouping genes in microarray data analysis [Eisen et al. 
(1998); Park, Hastie and Tibshirani (2007); Ma, Song and Huang (2007)]. 
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Finally, in numerous problems of multiple inference, the sign- coherence as- 
sumption is also reasonable: when predicting closely related responses (e.g., 
regressing male and female life expectancy against economic and social vari- 
ables) or when analyzing multilevel data (e.g., predicting academic achieve- 
ment against individual factors across schools), the set of coefficients asso- 
ciated to a predictor (resp., for all response variables or all data clusters) 
forms a group that can often be considered as sign-coherent because effects 
can be assumed to be qualitatively similar. Along these lines, we successfully 
applied the coop-Lasso penalizer for the joint inference of several network 
structures [Chiquet, Grandvalet and Ambroise (2011)]. 

The rest of the paper is organized as follows: Section 2 presents the coop- 
Lasso penalty, with the derivation of the optimality conditions which are 
the basis for an active set algorithm. Consistency results and the associated 
irrepresentable conditions are given in Section 3. In Section 4 we derive an 
approximation of the degrees of freedom that can be used in the Bayesian 
Information Criterion (BIC) and the Akaike Information Criterion (AIC) for 
model selection. Section 5 is dedicated to simulations assessing the perfor- 
mances of the coop-Lasso in terms of sparsity pattern recovery, parameters 
estimation and robustness. Section 6 considers real data sets, with ordi- 
nal and continuous covariates. Note that all proofs are postponed until the 
Appendix. 

2. Cooperative-Lasso. 

2.1. Definitions and optimality conditions. Group-norm and coop-norm. 
We define a group structure by setting a partition of the index set I = 
{1, . . . ,p}, that is, 

K 

X = y ^fc with n = for A; / £. 

k=l 

Let V = (vi, . . . jVpY G and pk denote the cardinality of group k. We 
define vg^, G W' as the vector {vj)jeQk - -^^o^ chosen groups {Qk}k=i^ 
group-Lasso norm reads 

K 

(3) \M&onv = ^Wk\\^g^\\, 

k=l 

where Wk> are fixed parameters enabling to adapt the amount of penalty 
for each group. Likewise, the sparse group-Lasso norm [Friedman, Hastie and 
Tibshirani (2010)] is defined as a convex combination of the group-Lasso and 
the ii norms: 

(4) ||v||sgi = a||v||group + (1 - a)l|v||i, 
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where a is meant to be a tuning parameter, but may be fixed to 1/2 [Fried- 
man, Hastie and Tibshirani (2010); Zhou et al. (2010)]. We wih always set 
it to this default value in what follows. 

Let v+ = {vf, . . . , v:^y and v~ = {v^ , . . . , v~)t be the componentwise pos- 
itive and negative part of v, that is, Vj~ = max(0, Vj) and vj = max(0, —Vj), 
respectively. We call coop-norm of v the sum of group- norms on and v~ , 

K 

||v||coop = l|v+||group + ||v"||group = X]^'^'(ll^eJI + H^eJD' 

k=l 

which is clearly a norm on M^. 

The coop-Lasso estimate of /3* as defined in (1) is 

(5) ^-°P = argminL(/3) with L(/3) = i||y - X/3||2 + A||/3||coop, 

where A > is a tuning parameter common to all groups. Appropriate 
choices for A will be discussed in Sections 4 and 3 dealing with model selec- 
tion and consistency, respectively. 

Illustrations of the group, sparse group and coop norms are given in Fig- 
ure 1 for a vector f3 = (/3i, /32, /^s, /34)t with two groups Qi = {1,2} and 
Q2 = {3,4}. We represent several views of the unit ball for each of these 
norms. For the coop- norm, this ball represents the set of feasible solutions 
for an optimization problem equivalent to (5), where the sum of squared 
residuals is minimized under unitary constraints on ||/3||coop- The same in- 
terpretation holds for the group and sparse group norms, provided the sum 
of squared residuals is minimized under unitary constraints on ||/3|| group and 
||/3||sgi, respectively. 

These plots provide some insight into the sparsity pattern that originates 
from the penalties, since sparsity is related to the singularities of the bound- 
ary of the feasible set. First, consider the group-Lasso: the first row illustrates 
that when /34 is null its group companion f^^ may also be exactly zero (cor- 
ners on the boundary at /Ss = 0), while the second row shows that this event 
is improbable when /34 differs from zero (smooth boundary at /Ss = 0). The 
second and third columns display the same type of relationships within Qi 
between /32 and which are expected due to the symmetries of the unit 
ball. The last column displays £2 balls, which characterize the within-groups 
feasibility subsets, showing that once a group is activated, all its members 
will be nonzero. 

Now, consider the sparse group-norm: the combination of the group and 
Lasso penalties has uniformly shrunk the feasible set toward the Lasso ii unit 
ball, thus creating new edges that provide a chance to zero any parameter in 
any situation, with an elastic-net-like penalty [Zou and Hastie (2005)] within 
and between groups. The comparison of the last two columns illustrates that 
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Fig. 1. Feasible sets for the coop-Lasso, group-Lasso and sparse group-Lasso penalties. 
First column: cuts through {j3i, (32, Ps) at 134, = G and ^4 = 0.3/ {Pi, 132) span the horizontal 
plane and ^3 is on the vertical axis; second and third columns: cuts through (/3i,/33) at 
various values of {(32, Pi) ; last column: cuts through {^1,^2) at various values of {Pa, P4) ■ 
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the differentiation between the within-group and between group penalties is 
less marked than for the group-Lasso. 

Finally, consider the coop-norm: compared to the group-norm, there are 
also additional discontinuities resulting in new edges on the 3-D plots. While 
the sparse group-Lasso edges where created by a uniform shrinking toward 
the £i unit ball, the coop-Lasso new edges result from slicing the group- 
Lasso unit ball, depriving sign-incoherent orthants from some of the group- 
Lasso feasible solutions (||/3||coop > 1 1/3 1 1 group in these regions). Note that, in 
general, there are less new edges than with the sparse group-Lasso, since 
the new opportunities to zero some coefficients are limited to the case where 
the group-Lasso would have allowed a solution with opposite signs within 
a group. The crucial difference with the group and sparse group-Lasso is 
the loss of the axial symmetry when some variables are nonzero: decoupling 
the positive and negative parts of the regression coefficients favors solutions 
where signs match within a group. Slicing of the unit group-norm ball does 
not affect the positive and negative orthants, but large areas corresponding 
to sign mismatches have been peeled off, as best seen on the last column, 
which also illustrates the strong differentiation between within-group and 
between-group penalties. 

Before stating the optimality conditions for problem (5), we introduce 
some notation related to the sparsity pattern of parameters, which will be re- 
quired to express the necessary and sufficient condition for optimality. First, 
we recall that the unknown vector of parameters j3* is typically sparse; its 
support is denoted S = {j, (3^ ^ 0} and S'^ = { j, /3* = 0} is the complemen- 
tary set of true zeros. Once the problem has been supplied with a group 
structure, we define Sk = S CiQk and S^ = S^Ci Qk as the sets of relevant, re- 
spectively irrelevant, predictors within group fc, for all /c = 1, . . . , Similar 
notation 5(/3), Sk{(3) and is defined for an arbitrary vector (3 gW. 

Furthermore, for clarity and brevity, we introduce the functions 
which return the componentwise positive or negative part of a vector accord- 
ing to the sign of its jth element, that is, VA; G {1, . . . ,K},\/j G ^fc,Vv G W'', 



Optimality conditions. The objective function L in (5) is continuous and 
coercive, thus problem (5) admits at least one minimum. If X has rank p, 
then the minimum is unique since L is strictly convex. Furthermore, L is 
smooth, except at some locations with zero coefficients, due to the singulari- 
ties of the coop-norm. Since L is convex, a necessary and sufficient condition 
for the optimality of /3 is that the null vector belongs to the subdifferential 
of L whose expression is provided in the following lemma. 





if Vj = 0, 
if Vj > 0, 
if Vj < 0. 
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Lemma 1. For all P G MP, the subdifferential of the objective function of 
problem (5) is 

(7) a;3L(/3) = {vGMP:v = XT(X/3-y) + A0}, 

where is any vector belonging to the subdifferential of the coop-norm, 
that is, 

(8a) VfcG{l,...,K},VjG5fc(/3) 9, 



(8b) VA; G {1, . . . , K},yj G 5^(/3) ||<^,(0ej|| < w^. 

The following optimality conditions, which result directly from Lemma 1, 
are an essential building block of the algorithm we propose to compute the 
coop-Lasso estimate. They also provide an important basis for showing the 
consistency results. 

Theorem 1. Problem (5) admits at least one solution, which is unique 
if X has rank p. All critical points j3 of the objective function L verifying 
the following conditions are global minima: 

(9a) VfcG{l,...,i^},VjGcSfc(/3) xT(X/3-y) + ^^^=0, 

(9b) VA;G{l,...,i^},VjG5^(/3) ||v?,((X gJT(x/3 - y))|| < Aw;^, 
where X.g^, is the submatrix of X with all rows and columns indexed by Qk ■ 

Note here an important distinction compared to the group-Lasso, where 
the optimality conditions are expressed solely according to the groups Qk 
[see, e.g., Roth and Fischer (2008)]. Hence, while the sparsity pattern of 
the solution is strongly constrained by the predefined group structure in the 
group-Lasso, deviations from this structure are possible for the coop-Lasso. 
The asymptotic analysis of Section 3 confirms that exact support recovery 
is possible even when the support of /3* cannot be expressed as a simple 
union of groups, provided the groups intersecting the true support are sign- 
coherent. 

2.2. Algorithm. The efficient approaches developed for the Lasso take 
advantage of the sparsity of the solution by solving a series of small linear 
systems, whose sizes are incrementally increased/decreased [Osborne, Pres- 
nell and Turlach (2000)]. This approach was pursued for the group-Lasso 
[Roth and Fischer (2008)] and we proposed an algorithm in the same vein 
for the coop-Lasso in the framework of multiple network inference [Chiquet, 
Grandvalet and Ambroise (2011)]. We provide here a more detailed descrip- 
tion of the latter in the specific context of linear regression. 

The algorithm starts from a sparse initial guess, say, /3 = 0, and iterates 
two steps: 
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1. The first step solves problem (5) with respect to /3_4, the subset of 
"active" variables, currently identified as being nonzero. At this stage the 
current feasible set is restricted to the orthants where the gradient of the 
coop-norm has no discontinuities: the optimization problem is thus smooth. 
One or more variables may then be declared inactive if the current opti- 
mal reaches the boundary of the current feasible set. 

2. The second step assesses the completeness of the set A, by checking 
the optimality conditions with respect to inactive variables. We add a group 
that violates these conditions. In our implementation, we pick the one that 
most violates the optimality condition, since this strategy has been observed 
to require few changes in the active set. When no such violation exists, the 
current solution is optimal. 

These two steps outline the algorithm, which is detailed in more technical 
terms in Algorithm 1. The principle is readily applied to any generalized 
linear model by simply defining the appropriate objective function L. In 
our current implementation (a pre-release of our R-package scoop is avail- 
able at http : //stat . genopole . cnrs . f r/logiciels/scoop) the linear and 
logistic regression models are implemented using either Broyden-Fletcher- 
Goldfarb-Shanno (BFGS) quasi-Newton updates with box constraints, or 
proximal methods [Beck and Teboulle (2009)] to solve the smooth optimiza- 
tion problem in Step 1. 

Finally, note that to compute a series of solutions along the regularization 
path for problem (5), we simply choose a series of penalties = Amax > • • • > 

A^ > • • • > A-^ = Amin > such that ;3'=°°P(Ainax) = 0, that is, 



We then use the usual warm start strategy, where the feasible initial guess for 



^™°P(A ), the coop-Lasso estimate with penalty parameter A , is initialized 
with ^™°P(A'-i). 



2.3. Orthonormal design case. The orthonormal design case, where 
X^X = Ip, has been providing useful insights for penalization techniques 
regarding the effects of shrinkage. Indeed, in this particular case, most usual 
shrinkage estimators can be expressed in closed-form as functions of the or- 
dinary least squares (OLS) estimate. These expressions pave the way for the 
derivation of approximations of the degrees of freedom [Tibshirani (1996); 
Yuan and Lin (2006) and Section 4], which may be convenient for model 
selection in the absence of exact formulae. 

In the orthonormal setting, for any /3j, we have xj(X/3 — y) = /3j — (3°^^. 
The optimality conditions (9a) and (9b) can then be written as 



A, 



■max — 





icoop I ^ 
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Algorithm 1: Coop-Lasso fitting algorithm 



Init. Start from a feasible (3 <^ (3^ 

^+^{ieafc:||/3^J| >0,fc = l,...,i^}, 
A-^{jeGk:\\l3gJ\>G.k = l,...,K}. 
Step 1 On A ^ Aj^ U A^ , find a solution to the smooth problem 

^ argmin^lly - X^v||^ + A||v||coop 



S.t. 



Vj>0, ifjG^+n^i, 
z;j<0, ifiG^_n^^, 



where A^_ and are the complementary sets of A- and .4+, re- 
spectively. 

Identify groups inactivated during optimization 



^+^^+\{iGgfeC^+: 11/3+ 11=0 



and min 11 v 11 = 0, = 1, . . . , >, 

vG9^g^L(/3) J 







and min ||v+|| = 0, /c = 1, . . . , -fC >. 

veO0g^L(/3) J 

Step 2 Identify the greatest violation of optimality conditions: 

min ||v+||, g-^argmaxg^, 

k ■ II — II k 

5_ mm V , r arg max 5^ 

ve9^g^L(/3) 

if max{g'^ , g"^) = then 

I Stop and return /3, which is optimal 
else 

if 5+ > gL then ^_ ^ ^_ U else ^+ ^ ^+ U Gr 
Repeat Steps 1 and 2 until convergence 



For reference, we recall the solution to the group-Lasso [Yuan and Lin (2006)] 
in the same condition 

(11) VA; G {1, . . . , K}yj G /3™ = (l - ^) ^/3f , 
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while the Lasso solution [Tibshirani (1996)] is 

(12) VjG{l,...,p} /3r°=(l-p^)V- 

Equations (10)-(12) reveal strong commonalities. First, the coefficients of 
these shrinkage estimators are of the sign of the OLS estimates. Second, the 
norm used in the penalty defines a region where small OLS coefficients are 
shrunk to zero, while large ones are shrunk inversely proportional to this 
norm. Finally, by grouping the terms corresponding to one group in equa- 
tions (lO)-(ll), a uniform translation effect, analogous to the one observed 
for the Lasso, comes into view: 

vfc G {1, . . . vj G ||<P,-(^g7)|| = {W^MDW - ^^k)^^ 

(13) VA: G {1, . . . , K} = - \wk)+, 

VjG{l,...,p} |/3f-| = (|/3f |-Au;,)+. 

The group-Lasso (11) differs primarily from the Lasso (12) owing to the 
common penalty At(;fc/||^g^|| for all the coefficients belonging to group k. The 
magnitude of shrinkage is determined by all within-group OLS coefficients, 
and is thus radically different from a ridge regression penalty in this regard. 
For the coop-Lasso estimator (10), two penalties possibly apply to group k, 
for the positive and the negative OLS coefficients, respectively. If all within- 
group OLS coefficients are of the same sign, coop-Lasso is identical to group- 
Lasso; if some signs disagree, the magnitude of the penalty only depends on 
the within-group OLS coefficients with an identical sign. In the extreme case 
where exactly one OLS coefficient is positive/negative, the coop-penalty is 
identical to a Lasso penalty on this coefficient. 

Note that such a simple analytical formulation is not available for the 
sparse group-Lasso estimate (3^^^, but an expression can be obtained by 
chaining two simple shrinkage operations. Introducing an intermediate so- 
lution 0^^^, we have, VA; G {1, . . . , K} and \/j G Gk, 

(") ¥={^-^^j^y^r ->>-4f=(i-|^)V 

The intermediate solution is the Lasso estimator with penalty param- 
eter Xa, which acts as the OLS estimate for a group-Lasso of parameter 
A(l - a). 

Figure 2 provides a visual representation of equations (10)-(12) and (14) 
for a group with two components, say, Gk = {1; 2}. We plot /J}*^*^*^", Pf^"^^, 
and (3^°°^ as functions of (/3°'^, /32^'^). Top-left, the Lasso translates the 13°^^ 
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Fig. 2. Lasso, group, sparse group and coop Lasso coefficient estimates, for a group with 
2 elements Qk = {1,2}, as a function of the OLS coefficients. The colors emphasize the 
positive and negative quadrants of the {P'\^,P'2^) plane, with red and blue, respectively. 

coefficient toward zero, eventually truncating them at zero, regardless of fi^^: 
there is no interaction between coefficients. The group-Lasso, top-right, has 
a nonlinear shrinking behavior (quite different from the Lasso or ridge penal- 
ties in this respect) and sets (3^°^'^ to zero within a Euclidean ball centered at 
zero. The sparse group-Lasso, bottom- left, is a hybrid of Lasso and group- 
Lasso, whose shrinking behavior lies between its two ancestors. Bottom- 
right, the coop-Lasso appears as another form of cross-breed, identical to 
the group-Lasso in the positive and negative quadrants, and identical to the 
Lasso when the signs of the OLS coefficients mismatch. For groups with 
more than two components, intermediate solutions would be possible. This 
behavior is shown to allow for some flexibility with respect to the predefined 
group structure in the following consistency analysis. 

3. Consistency. Beyond its sanity-check value, a consistency analysis 
brings along an appreciation of the strengths and limitations of an esti- 
mation scheme. Here we concentrate on the estimation of the support of the 
parameter vector, that is, the position of its zero entries. Our proof tech- 
nique is drawn from the previous works on the Lasso [Yuan and Lin (2007)] 
and the group-Lasso [Bach (2008)]. 
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In this type of analysis, some assumptions on the joint distribution of 
(X, Y) are required to guarantee the convergence of empirical covariances. 
For the sake of simplicity and coherence, we keep assuming that data are 
centered so that we have zero mean random variables and ^ = E[XXt] is 
the covariance matrix of X: 

(Al) X and Y have finite 4th order moments E[||X||^] < oo, E[y''] < oo. 
(A2) The covariance matrix * = E[XXT] e W^p is invertible. 

In addition to these standard technical assumptions, we need a more spe- 
cific one, substantially avoiding situations where the coop-Lasso will almost 
never recover the true support: 

(A3) All sign-incoherent groups are included in the true support: VA; G 
{1, . . . if ||(/3^J+|| > and \\{(3lJ-\\ > 0, then Vj E Gk, (3*, + 0. 

Note that this latter assumption is less stringent than the one required for 
the group-Lasso since it does not require that each group of variables should 
either be included in or excluded from the support. For the coop-Lasso, 
sign-coherent groups may intersect the support. 

The spurious relationships that may arise from confounding variables are 
controlled by the so-called strong irrepresentable condition, which guaran- 
tees support recovery for the Lasso [Yuan and Lin (2007)] and the group- 
Lasso [Bach (2008)]. We now introduce suitable variants of these conditions 
for the coop-Lasso. They result in two assumptions: a general one, on the 
magnitude of correlations between relevant and irrelevant variables, and 
a more specific one for groups which intersect the support, on the sign of 
correlations. These conditions will be expressed in a compact vectorial form 
using the diagonal weighting matrix D(/3) such that, 

(15) VA:G{l,...,i^},VjGcSfc(/3) (D(/3)),-, = u;fc||vp,(/3gj|r^ 

(A4) For every group including at least one null coefficient (i.e., such 
that /S^ = for some j G Q}^ or, equivalently, 7^ 0), there exists 77 > such 
that 

(16) ^max(||(*5^5*55D(/35)/35)+ll, ||(*5,^5*55D(/35)/3S)"ll) < 1 " ^, 

where ^sT is the submatrix of ^ with lines and columns respectively in- 
dexed by S and T ■ 

(A5) For every group intersecting the support and including either 
positive or negative coefficients, letting z/^ be the sign of these coefficients 
[ffc = 1 if ||(/3g^)+|| > and vi^ = -\ if ||(/3gJ"|| > 0], the following inequal- 
ities should hold: 

where ■< denotes componentwise inequality. 
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Note that the irrepresentable condition for the group-Lasso only considers 
correlations between groups included and excluded from the support. It is 
otherwise similar to (16), except that the elements of the weighting matrix D 
are w^fe||/3gj. ||~"^ and that the £2 norm replaces maxdK-)"*"!!, ||(-)~||)- 

We now have all the components for stating the coop-Lasso consistency 
theorem, which will consider the following normalized (equivalent) form of 
the optimization problem (5) to allow a direct comparison with the known 
similar results previously stated for the Lasso and group-Lasso [Yuan and 
Lin (2007); Bach (2008)]: 

(18) = argmin^lly - X/3f + A„||/3||coop, 
where = X/n. 

Theorem 2. If assumptions (A1)-(A5) are satisfied, the coop-Lasso es- 
timator is asymptotically unbiased and has the property of exact support 
recovery: 

(19) ^-^^^(3* and P(5(^r'') = 5) ^ 1, 
for every sequence A„ such that A„ = Aon^'^,7 € (0, 1/2). 

Compared to the group-Lasso, the consistency of support recovery for the 
coop-Lasso differs primarily regarding possible intersection (besides inclu- 
sion and exclusion) between groups and support. This additional flexibility 
applies to every sign-coherent group. Even if the support is the union of 
groups, when all groups are sign-coherent, the coop-Lasso has still an edge 
on group-Lasso since the irrepresentable condition (16) is weaker. Indeed, 
the norm in (16) is dominated by the £2 norm used for the group-Lasso. The 
next paragraph illustrates that this difference can have remarkable outcomes. 
Finally, when the support is the union of groups comprising sign-incoherent 
ones, there is no systematic advantage in favor of one or the other method. 
While the norm used by the coop-Lasso is dominated by the norm used by 
the group-Lasso, the weighting matrix D has smaller entries for the latter. 

Illustration. We generate data from the regression model (1), with /3* = 
(1, 1, —1, —1, 0, 0, 0, 0), equipped with the group structure {Gk}t=i = 2}, 
{3, 4}, {5, 6}, {7, 8}}. The vector X is generated as a centered Gaussian 
random vector whose covariance matrix ^ is chosen so that the irrepre- 
sentable conditions hold for the coop-Lasso, but not for the group-Lasso, 
which, we recall, are more demanding for the current situation, with sign- 
coherent groups. The random error e follows a centered Gaussian distribu- 
tion with standard deviation cr = 0.1, inducing a very high signal to noise 
ratio (R^ = 0.99 on average), so that asymptotics provide a realistic view of 
the finite sample situation. 
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Fig. 3. 50% coverage intervals for the group (left), sparse group (center) and (right) 
Lasso estimated coefficients along regularization paths: coefficients from the support of (3* 
are marked by colored horizontal stripes and the other ones by gray vertical stripes. 

We generated 1000 samples of size n = 20 from the described model, 
and computed the corresponding 1000 regularization paths for the group- 
Lasso, sparse group-Lasso and coop-Lasso. Figure 3 reports the 50% coverage 
intervals (lower and upper quartiles) along the regularization paths. In this 
setup, the sparse group-Lasso behaves as the group-Lasso, leading to nearly 
identical graphs. Estimation is difficult in this small sample problem (n = 
20, p = 8), and the two versions of the group-Lasso, which first select the 
wrong covariates, never reach the situation where they would have a decisive 
advantage upon OLS, while the coop-Lasso immediately selects the right 
covariates, whose coefficients steadily dominate the irrelevant ones. Model 
selection is also difficult, and the BIC criteria provided in Section 4 select 
often the OLS model (in about 10% and 50% of cases for the coop-Lasso 
and the group-Lasso, respectively). The average root mean square error on 
parameters is of order 10~^ for all methods, with a slight edge for the coop- 
Lasso. The sign error is much more contrasted: 31% for the coop-Lasso vs. 
46% for the group-Lasso, not far better than the 50% of OLS. 

4. Model selection. Model selection amounts here to choosing the pe- 
nalization parameter A, which restricts the size of the estimate /3(A). Trial 
values {Amiii) ■ ■ ■ , Amax} define the set of models we have to choose from 
along the regularization path. The process aims at picking the model with 
minimum prediction error, or the one closest to the model from which data 
have been generated, assuming the model is correct, that is, equation (1) 
holds. Here "closest" is typically measured by a distance between (3 and P* , 
either based on the value of the coefficients or on their support (true model 
selection) , and sometimes also on the sign correctness of each nonzero entry. 

Among the prerequisite for the selection process to be valid, the previous 
consistency analysis comes up with suitable orders of magnitude for the 
penalty parameter A. However, it does not provide a proper value to be 
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plugged in (5) and the practice is to use data driven approaches for selecting 
an appropriate penalty parameter. 

Cross-validation is a recommended option [Hesterberg et al. (2008)] when 
looking for the model minimizing the prediction error, but it is slow and not 
well suited to select the model closest to the true one. Analytical criteria 
provide a faster way to perform model selection and, though the information 
criteria AIC and BIC rely on asymptotic derivations, they often offer good 
practical performances. The BIC and AIC criteria for the Lasso [Zou, Hastie 
and Tibshirani (2007)] and group-Lasso [Yuan and Lin (2006)] have been 
defined through the effective degrees of freedom: 

(20) AIC(A)= **^~y^**% 2df(A), 

(21) BIC(A)= +log(n)df(A), 

where y(A) = X/3(A) is the vector of predicted values for (5) with penalty 
parameter A, is the variance of the zero- mean Gaussian error variable e 
in (1) and df(A) is the number of degrees of freedom of the selected model. 
Assuming that equation (1) holds and a differentiability condition on the 
mapping y(A), Efron (2004), using Stein's theory of unbiased risk estimate 
[Stein (1981)], shows that 

1 " 

(22) df(A) = ^ J^cov(y,(A),yi) = IE 

where the expectation is taken with respect to y or, equivalently, to the 
noise e. Yuan and Lin (2006) proposed an approximation of the trace term 
in the right-hand side of (22), which is used to estimate df(A) for the group- 
Lasso: 

~ ^ / 1 1(9^™'^'' (All I 

(23) dfg,oup(A) = 1(||^™(A)|| > 0) 1 + \'2J {Vk - 1) 

fc=i ^ II^gJI 

where l(-) is the indicator function and pfc is the number of elements in Q}^. 
For orthonormal design matrices, (23) is an unbiased estimate of the true 
degrees of freedom of the group-Lasso and Yuan and Lin (2006) suggest 
that this approximation is relevant in more general settings, by reporting 
that "the performance of this approximate Cp-criterion [directly derived 
from (23)] is generally comparable with that of fivefold cross-validation and 
is sometimes better." 

This approximation of df(A) relies on the OLS estimate and is hence 
limited to setups where the latter exists and is unique. In particular, the 
sample size should be larger than the number of predictors {n > p). To 
overcome this restriction, we suggest a more general approximation to the 
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degrees of freedom, based on the ridge estimator 
(24) ^"<is^(7) = (XTX + 7l)-iXTy, 

which can be computed even for small sample sizes {n <p). 

Proposition 1. Consider the coop-Lasso estimator (3'^°°p{X) defined 
by (5). Assuming that data are generated according to model (1), and that X 
is orthonormal, the following expression o/dfcoop(A) is an unbiased estimate 
o/df(A) defined in (22) for the coop-Lasso fit: 



dfcoop(A) =yM\mrwr\\ > o)( i + 



i(ii(/3arwni>o) 1+^ 

I -i- 'V II / /-^^ ^ 

where p\ and are respectively the number of positive and negative entries 



Proposition 1 raises a practical issue regarding the choice of a good refer- 
ence ;9"'^^®(7). In our numerous simulations (most of which are not reported 
here), we did not observe a high sensitivity to 7, though high values degrade 
performances. When X is full rank we use 7 = (the OLS estimate) and, 
correspondingly, a vanishing 7 (the Moore-Penrose solution) when X is of 
smaller rank. More refined strategies are left for future works. 

Section 5 illustrates that, even in nonorthonormal settings, plugging ex- 
pression (25) for the degrees of freedom df(A) of the coop-Lasso in BIG (21) 
or AIC (20) provides sensible model selection criteria. As expected, BIG, 
which is more stringent than AIG, is better at retrieving the sparsity pat- 
tern of /3*, while AIG is slightly better regarding prediction error. 

5. Simulation study. We report here experimental results in the regres- 
sion setup, with the linear regression model (1). Our simulation protocol is 
inspired from the one proposed by Breiman (1995, 1996) to test the nonneg- 
ative garrote estimator, which inspired the Lasso. 

5.1. Data generation. The structure of (3* G MP is controlled through 
sparsity at coefficient and group levels. Here we have p = 90, forming K = 10 
groups of identical size, pk = 9. All groups of parameters follow the same 
wave pattern: for j £ {1,. . . ,9}, {Pgi^)j oc i^kii^ ~ |5 — where G 

{0, 1} is a switch at the group level and h G {3, 4, 5} governs the wave width, 
that is, the within-group sparsity, with respectively \Sk\ € {5,7,9} nonzero 
coefficients in each group included in the support. The covariates are drawn 
from a multivariate normal distribution X ~ AA(0, 'J') with, for all (j,j') G 
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{1, . . . covariances ^jj' = p^^~^'^, where p £ [—1, 1]. Finally, the response 
is corrupted by an error variable e ~ M{0, 1) and the magnitude of the vector 
of parameters /3* is chosen to have an around 0.75. 

Note that the covariance of the covariates is purposely disconnected from 
the group structure. This setting may either be considered as unfair to the 
group methods, or equally adverse for all Lasso-type estimators, in the sense 
that none of their support recovery conditions are fulfilled when p^O. Situa- 
tions more or less advantageous for group methods are then produced thanks 
to the parameter h, which determines how the support of /3* matches the 
group structure. 

5.2. Results. Model selection is performed with BIC (21) for Lasso, group- 
Lasso and coop-Lasso. The estimation of the degrees of freedom for the Lasso 
is the number of nonzero entries in ^''^'^'^°(A) [Zou, Hastie and Tibshirani 
(2007)]. As there is no such analytical estimate of the degrees of freedom for 
the sparse group-Lasso, we tested two alternative model selection strategies: 
standard five- fold cross-validation (CV), selecting the model with minimum 
cross-validation error, and the so-called "1-SE rule" [Breiman et al. (1984)], 
which selects the most constrained model whose cross-validation error is 
within one standard error of the minimum. 

First, we display in Figure 4 an example of the regularization paths ob- 
tained for each method for a small training set size {n = p/2 = 45) drawn 
from the model with three active groups having two zero coefficients each 
{\Sk \ = 7 , pk = 9) and a moderate positive correlation level {p = 0.4). As ex- 
pected, the nonzero coefficients appear one at a time along the Lasso regular- 
ization path and groupwise for the other methods, which detect the relevant 
groups early, with some coefficients kept to zero for the sparse group-Lasso 
and the coop-Lasso. The sparse group-Lasso is qualitatively intermediate 
between the group-Lasso and the coop-Lasso, setting many parameters to 
zero, but keeping a few negative coefficients in the solution. The coefficients 
of the model estimated by BIC or the 1-SE rule are displayed on the right of 
each path. The Lasso estimate includes some nonzero coefficients from irrel- 
evant groups, but is otherwise quite conservative, excluding many nonzero 
parameters from its support. This conservative trend is also observed for 
the group methods, which exclude all irrelevant groups. The three group 
estimates mostly agree on truly important coefficients, and differ in the 
treatment of the spurious negative values that are frequent for group-Lasso, 
rarer for sparse group-Lasso and do not occur for coop-Lasso. 

Table 1 provides a more objective evaluation of the compared methods, 
based on the root mean square error (RMSE) and the support recovery 
(more precisely, recovery of the sign of true parameters); prediction error 
(not shown) is tightly correlated with RMSE in our setup. Regarding the 
relative merits of the different methods, we did not observe a crucial role of 
the number of active groups and the covariate correlation level p. We report 
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Fig. 4. Lasso, group, sparse group and coop Lasso estimates for a training set of size 
n = 45 drawn from the generation process of Section 5.1, with 3 active waves out of 10, 
\Sk\/Pk = 7/9 '^''^d. p — 0.4. Left: regularization paths, where each line type/color represents 
a group of parameters and the plain vertical line marks the model selected by the 1-SE rule 
for sparse group-Lasso and BIC otherwise; right: true signal (dotted line) and estimated 
parameters for the selected model (filled circles). 
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Table 1 

Average errors, with standard deviations, on 1000 simulations from the setup 
described in Section 5.1. Each scenario differs m the number of observations n 
and the number of active variables per active group \Sk\ (pk Sparse-cv and 

sparse-l-se designate the sparse group-Lasso wtth X selected by cross-validation 
and by the 1-SE rule, respectively 

Lasso Group Sparse-cv Sparse-l-se Coop 



Scenario RMSE (xlO^) 



\Sk 


= 5 


n = 


--45 


87.1 


(0 


5) 


95.0 


(0.5) 


82.5 


(0.5) 


88.1 


(0.6) 


84.2 


(0.5) 






n — 


180 


43.7 


(0 


2) 


49.1 


(0.2) 


41.7 


(0.2) 


44.9 


(0.2) 


43.5 


(0.2) 






n — 


450 


28.8 


(0 


1) 


33.4 


(0.1) 


27.2 


(0.1) 


30.9 


(0.1) 


29.4 


(0.1) 


\Sk 


= 7 


n = 


--45 


93.0 


(0 


5) 


85.8 


(0.5) 


79.7 


(0.4) 


83.6 


(0.5) 


76.8 


(0.5) 






n — 


180 


48.4 


(0 


2) 


44.5 


(0.2) 


42.2 


(0.2) 


43.7 


(0.2) 


40.4 


(0.2) 






n — 


450 


31.8 


(0 


1) 


30.3 


(0.1) 


27.7 


(0.1) 


30.0 


(0.1) 


27.6 


(0.1) 


\Sk 


= 9 


n = 


= 45 


99.2 


(0 


4) 


82.0 


(0.5) 


81.0 


(0.4) 


83.2 


(0.5) 


73.7 


(0.5) 






n = 


180 


52.5 


(0 


2) 


41.9 


(0.2) 


43.3 


(0.2) 


43.8 


(0.2) 


39.0 


(0.2) 






n — 


450 


34.1 


(0 


1) 


28.7 


(0.1) 


28.8 


(0.1) 


30.6 


(0.1) 


27.1 


(0.1) 


Scenario 














Mean sign error 


(%) 








\Sk 


= 5 


n = 


= 45 


13.8 


(0 


1) 


18.3 


(0.2) 


36.7 


(0.4) 


16.9 


(0.3) 


13.3 


(0.2) 






n = 


180 


8.4 


(0 


1) 


19.3 


(0.2) 


36.1 


(0.4) 


10.7 


(0.2) 


13.0 


(0.2) 






n = 


450 


6.1 


(0 


1) 


16.7 


(0.2) 


35.5 


(0.4) 


7.1 


(0.2) 


10.3 


(0.2) 


\Sk 


= 7 


n = 


= 45 


18.9 


(0 


1) 


12.9 


(0.2) 


34.6 


(0.4) 


16.8 


(0.3) 


10.1 


(0.2) 






n — 


180 


11.9 


(0 


1) 


12.7 


(0.2) 


34.5 


(0.4) 


10.5 


(0.2) 


9.8 


(0.2) 






n — 


450 


8.8 


(0 


1) 


10.4 


(0.2) 


34.9 


(0.4) 


7.1 


(0.2) 


7.7 


(0.2) 


\Sk 




n = 


= 45 


24.4 


(0 


1) 


8.1 


(0.2) 


34.2 


(0.4) 


17.3 


(0.3) 


7.9 


(0.2) 






n = 


180 


15.3 


(0 


1) 


6.3 


(0.2) 


33.5 


(0.4) 


10.0 


(0.2) 


6.7 


(0.2) 






n — 


450 


11.2 


(0 


1) 


4.3 


(0.1) 


32.6 


(0.4) 


6.0 


(0.2) 


4.5 


(0.1) 



results for a true support comprising 3 groups out of 10 and p = 0.4, with 
various within- group sparsity and sample size scenarios. 

All estimators perform about equally in RMSE, the sparse group-Lasso 
with CV having a slight advantage over the coop-Lasso when many zero 
coefficients belong to the active groups, and the coop-Lasso being marginally 
but significantly better elsewhere. 

Regarding support recovery, model selection with CV leads to models 
overestimating the support of parameters. The 1-SE rule, which slightly 
harms RMSE, is greatly beneficial in this respect. BIC also performs very 
well, incurring a very small loss due to model selection compared to the 
oracle solution picking the model with best support recovery. The Lasso 
dominates all the groups methods when many zero coefficients belong to 
the active groups. Elsewhere, group methods (with appropriate model selec- 
tion criteria) perform systematically significantly better for the small sample 
sizes. The coop-Lasso ranks first or a close second among group methods in 
all experimental conditions. It thus appears as the method of choice regard- 
ing inference issues when groups conform to the sign-coherence assumption. 



SPARSITY WITH SIGN-COHERENT GROUPS OF VARIABLES 



21 



Table 2 

Average errors, with standard deviations, on 1000 simulations from the setup of 
Table 1 with n — 180, perturbed by switching a proportion Pa of signs in /3* 



P. 




RMSE (xlO^) 




Mean sign error 


(%) 


\Sk\ =5 


|5fc| = 7 


|5fc| =9 


|5fc| =5 


\Sk\ = 7 


|5te| =9 


0.1 


46.9 (0.2) 


45.3 (0.2) 


45.8 (0.2) 


15.3 (0.2) 


12.4 (0.2) 


8.8 (0.2) 


0.2 


49.5 (0.3) 


48.9 (0.2) 


48.7 (0.2) 


17.8 (0.2) 


14.3 (0.2) 


9.8 (0.2) 


0.3 


51.0 (0.3) 


50.4 (0.3) 


50.4 (0.2) 


19.3 (0.2) 


14.8 (0.2) 


10.3 (0.2) 


0.4 


51.6 (0.2) 


51.0 (0.2) 


50.2 (0.2) 


19.7 (0.2) 


14.8 (0.2) 


9.8 (0.2) 


0.5 


52.3 (0.3) 


51.3 (0.2) 


50.8 (0.2) 


20.0 (0.2) 


14.6 (0.2) 


9.3 (0.2) 



5.3. Robustness. The robustness to violations of the sign-coherence as- 
sumption is assessed by switching a proportion Pg- of signs in the vector f3*, 
otherwise generated as before. The sign of the corresponding covariates are 
switched accordingly, to ensure that only the coop-Lasso estimators are af- 
fected in the process. 

Table 2 displays the coop-Lasso RMSE that degrades gradually with the 
amount of perturbation, becoming eventually worse than the Lasso, except 
for full groups. Regarding sign error, for small proportions of sign flip, the 
coop-Lasso stays at par with either Lasso or group-Lasso (see Table 1), 
but it eventually becomes significantly worse than both of them in most 
situations. Thus, if the sign-coherence assumption is not firmly grounded, 
either group-Lasso or its sparse version seem to be better options: coop- 
Lasso only remains a second-best choice when there are less than 10% of 
sign mismatches within groups. 

6. Illustrations on real data. This section illustrates the applicability of 
the coop-Lasso on two types of predictors, that is, categorical and continuous 
covariates. The first proposal may be widely applied to ordered categorical 
variables; the second one is specific to microarray data, but should apply 
more generally when groups of variables are produced by clustering. 

In the first application, each group is formed by a set of variables cod- 
ing an ordered categorical variable. Ordinal data are often processed either 
by omitting the order property, treating them as nominal, or by replacing 
each level with a prescribed value, treating them as quantitative. The latter 
procedure, combined with generalized linear regression, leads to monotone 
mapping from levels to responses. Section 6.1 describes how coop-Lasso can 
bias the estimate toward monotone mappings using a categorical treatment 
of ordinal variables. 

In the second application of Section 6.2, the groups are formed by contin- 
uous variables that are redundant noisy measurements (probe signals) per- 
taining to a common higher-level unobserved variable (gene activity). Sign- 
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coherence is expected here, since each measurement should be positively 
correlated with the activity of the common unobserved variable. A similar 
behavior should also be anticipated when groups of variables are formed by 
a clustering preprocessing step based on the Euclidean distance, such as k- 
means or average linkage hierarchical clustering [Eisen et al. (1998); Park, 
Hastie and Tibshirani (2007); Ma, Song and Huang (2007)]. 

6.1. Monotonicity of responses to ordinal covariates. Monotonicity is 
easily dealt with by transforming ordinal covariates into quantitative vari- 
ables, but this approach is arbitrary and subject to many criticisms when 
there is no well-defined numerical difference between levels, which often lacks 
even for interval data when the lower or the upper interval is not bounded 
[Gertheiss and Tutz (2009)]. Hence, the categorical treatment is often pre- 
ferred, even if it fails to fully grasp the order relation. 

The Lasso, group-Lasso or fused-Lasso have been applied to the categor- 
ical treatment of ordinal features, with the aim to select variables or aggre- 
gate adjacent levels [see Gertheiss and Tutz (2010) and references within]. 
The coop-Lasso is used here to make a stronger usage of the order relation- 
ship, by biasing the mapping from levels to the response variable toward 
monotonic solutions. Note that our proposal does not impose monotonicity 
and neither does it prescribe an order (although several variations would be 
possible here). In these respects, we depart from the approaches imposing 
hard constraints on regression coefficients [Rufibach (2010)]. 

6.1.1. Methodology. When not treated as numerical, ordinal variables are 
often coded by a set of variables that code differences between levels. Several 
types of codings have been developed in the ANOVA setting, with relatively 
little impact in the regression setting, where the so-called dummy codings are 
intensively used. Indeed, least squares fits are not sensible to coding choices 
provided there is a one-to-one mapping from one to the other, so that codings 
only matter regarding the direct interpretation of regression coefficients. 
However, codings evidently affect the solution in penalized regression, and we 
will use here specific codings to penalize targeted variations. In order to build 
a monotonicity-based penalty, we simply use contrasts that compare two 
adjacent levels. An example of these contrasts is displayed in Table 3, with 
the corresponding codings, known as backward difference codings, which are 
simply obtained by solving a linear system [Serlin and Levin (1985)]. Note 
that several codings are possible for the contrasts given in Table 3. They 
differ in the definition of a global reference level, whose effect is relegated to 
the intercept. As we do not penalize the intercept here, the particular choice 
has no outcome on the solution. 

Irrespective of the coding, group penalties act as a selection tool for fac- 
tors, that is, at variable level [Yuan and Lin (2006)]. On top of this, the 
sparse group penalty usually presents the ability to discard a level. With 
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Table 3 

Contrasts and codings for comparing the adjacent 
levels of a covariate with 4 levels 



Level 




Contrasts 






Codings 







-1 








-3/4 


-1/2 


-1/4 


1 


1 


-1 





1/4 


-1/2 


-1/4 


2 





1 


-1 


1/4 


1/2 


-1/4 


3 








1 


1/4 


1/2 


3/4 



difference codings, some increments between adjacent levels may be set to 
zero, that is, levels may be fused [Gertheiss and Tutz (2010)]. With the 
coop-Lasso penalty, all increments are urged to be sign-coherent, thereby 
favoring monotonicity. As a side effect, level fusion may also be obtained. 

6.1.2. Experimental setup. We illustrate the approach on the Statlog 
"German Credit" data set [available at the UCI machine learning reposi- 
tory, Frank and Asuncion (2010)], which gathers information about people 
classified as low or high credit risks. This binary response requires an appro- 
priate model, such as logistic regression. The coop-Lasso fitting algorithm is 
easily adaptable to generalized linear models, following exactly the structure 
provided in Algorithm 1, where the appropriate likelihood function replaces 
the sum of square residuals in Step 1. 

All quantitative variables are used for the analysis, but we focus here on 
the regression coefficients of four variables, encoded as integers or nominal 
in the Statlog project, which seem better interpreted as ordered nominal, 
namely: history, with 4 levels describing the ability to pay back credits 
in the past and now; savings, with 4 levels giving the balance of the sav- 
ing account in currency intervals; employment, with 5 levels reporting the 
duration of the present employment in year intervals; and job, with 4 lev- 
els representing an employment qualification scale. Two other variables, re- 
lated to the checking account status and property, were also encoded as 
nominal, but are not described here in full details since they do not show 
distinct qualitative behaviors between methods. We excluded from the or- 
dinal variables categories merging two subcategories possibly corresponding 
to different ranks, such as "critical account /other credits existing (not at 
this bank)" in history, or "unknown/no savings account" in savings. For 
simplicity, we suppressed the corresponding examples, thus ending with a to- 
tal of 330 observations, split into three equal-size learning, validation and 
test sets. We estimate the logistic regression coefficients on the learning 
set, perform model selection from deviance or misclassification error on the 
validation set, and finally keep the test set to estimate prediction perfor- 
mances. 
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Group-Lasso reg. coef. 



Coop-Lasao reg. coef. 



job 1-0 

sav. 2-1 job 2-1 
lust, 3-2 
BAV. 1-0 
map, 1-0 




0.6 -0,4 -O.a 0,0 0,2 D.4 0.€ 



hist. 2-1 




hjBt. 1-0 



logio(A) 



^.6 -fl.4 -0.3 

logio(A) 




Fig. 5. Regularization paths for four ordinal covanates (history, savings, job and em- 
ployment) for the group, coop, and sparse group-Lasso on the contrast coefficients obtained 
from backward difference coding (top left, top right and bottom left, respectively) . The tran- 
scription of contrasts to levels is also displayed for coop-Lasso (bottom right). The vertical 
lines mark the model selected by cross-validation on the validation set, for different crite- 
ria: deviance (plain), misclassification rate (dashed), and weighted misclassification error 
(dotted). 



6.1.3. Results. The performances of the three group methods are identi- 
cal, either evaluated in terms of deviance, classification error rate or weighted 
misclassification (unbalanced misclassification losses are provided with the 
data set). The regression coefficients differ, however, as shown in Figure 5 
displaying the regularization paths for all methods. Recall that we only rep- 
resent the ordinal covariates history, savings, employement and job. Each 
coefficient represents the increment between two adjacent levels, with posi- 
tive and negative values resulting in an increase and decrease, respectively. 
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Monotonicity with respect to all levels is reached if all the values correspond- 
ing to a factor are nonnegative or nonpositive. We also provide an alternative 
view of the coop-Lasso path, with the overall effects corresponding to levels, 
obtained by summing up the increments. 

Most factors are not obviously amenable to quantitative coding since there 
is no natural distance between levels, but we, however, underline that using 
the usual quantitative transformation with equidistant values followed by 
linear regression would correspond here to identical increments between lev- 
els. Obviously, all displayed solutions radically contradict this linear trend 
hypothesis. 

Our three solutions differ regarding monotonicity, which is almost never 
observed along the group-Lasso regularization path. The sparse group-Lasso 
paths have long sign-coherent sections, where group-Lasso infers slight wig- 
gles. These sections extend further with the coop-Lasso. However, as the 
coop penalty goes to zero, sign-coherence is no longer preserved, and all 
methods eventually reach the same solution. 

The sparse group and the coop-Lasso set some increments to zero, leading 
to the fusion of adjacent levels that should be welcomed regarding interpre- 
tation. The solutions tend to agree on these fusions on long sections of the 
paths, with some additional fusions of the sparse group-Lasso when slight 
monotonic solutions are provided by the coop-Lasso (see employment, lev- 
els 2 and 3, and savings levels 1 and 2). These fusions are perceived more 
directly on the coop-Lasso path of effects, displayed in the bottom right of 
Figure 5, where the effect of each level is displayed directly. 

6.2. Robust microarray gene selection. Most studies on response to che- 
motherapy have considered breast cancer as a single homogeneous entity. 
However, it is a complex disease whose strong heterogeneity should not be 
overlooked. The data set proposed by Hess et al. (2006) consists in gene ex- 
pression profiling of patients treated with chemotherapy prior to surgery, 
classified as presenting either a pathologic complete response (pCR) or 
a residual disease (not-pCR). It records the signal of 22,269 probes^ ex- 
amining the human genome, each probe being related to a unique gene. 
Following Jeanmougin, Guedj and Ambroise (2011), we restrict our analysis 
to the basal tumors: for this particular subtype of breast cancer, clinical and 
pathologic features are homogeneous in the data set, whereas the response 
to chemotherapy is balanced, with 15 tumors being labeled pCR and 14 
not-pCR. This setup is thus propitious to the statistical analysis of response 
to chemotherapy from the sole activity of genes. 



"^Actually, the data set reports the average signal in probe sets, which are a collection of 
probes designed to interrogate a given sequence. In this paper the term "probe" designates 
Affymetrix probe sets to avoid confusion with the group structure that will be considered 
at a higher level. 
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6.2.1. Methodology. The usual processing of microarray data relies on 
probe measurements that are related to genes in the final interpretation 
of the statistical analysis. Here we would like to take a different stance, 
by gathering all the measurements associated to gene entities at an early 
stage of the statistical inference process. As a matter of fact, we typically 
observe that some probes related to the very same gene have different behav- 
iors. Requiring a consensus at the gene level supports biological coherence, 
thus exercising caution in an inference process where statistically plausible 
explanations are numerous, due to the noisy probe signals and to the cum- 
bersome n <C p setup (here n = 29 and p = 22,269). Since the probes related 
to a given gene relate to sequences that are predominantly cooperating, the 
sign-coherence assumed by the coop-Lasso is particularly appropriate to im- 
prove robustness to the measurement noise and to encourage biologically 
plausible solutions. 

Our protocol includes a preselection of probes that facilitates the analysis 
for the nonadaptive penalization methods compared here, and also provides 
an assessment of the benefits of adding seemingly less relevant probes into 
the statistical analysis. We proceed as follows: 

• select a restricted number d of probes from classical differential analysis, 
where probes are sorted by increasing values; 

• determine the genes associated to these d probes, retrieve all the probes 
related to these genes, and select the corresponding p probes, p> d, re- 
gardless of their signal; 

• fit a model with group penalties where groups are defined by genes. 

6.2.2. Experimental setup. We select the first d = 200 most differentiated 
probes, as identified by the analysis of Jeanmougin, Guedj and Ambroise 
(2011), on the 22,269 probes for the n = 29 patients with basal tumor. These 
200 probes correspond to 172 genes, themselves associated to p = 381 probes 
on the microarray as a whole, with 1 to 13 probes per gene. We clearly enter 
the high-dimensional setup with p > 13 x n. 

All signals are normalized to have a unitary within-class variance. We 
compare then the Lasso on the d = 200 most differentiated probes, with the 
Lasso and group, sparse group and coop Lasso on the p = 381 probes. All fits 
are produced with our code (available at http : / / stat . genopole . cnrs . f r/ 
logic iels/scoop). 

Well-motivated analytical model selection criteria are not available today 
for Lasso-type penalties beyond the regression setup. Here, model selection 
is carried out by 5-fold cross-validation: we evaluate the CV error for each 
method with the same block partition using either the binomial deviance or 
the unweighted classification error. 

6.2.3. Results. The 5-folds CV scores, either based on deviance or mis- 
classification losses, are reported for each estimation method in Table 4, 
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Table 4 

CV scores for misclassification error and binomial deviance on the basal tumor data. The 
minimizer of CV for misclassification and deviance are respectively denoted by X°'^'^ 
and X^"^ ; the number of selected groups and features respectively refers to genes and 

probes 







Probes 


Lasso Group Sparse 


Coop 


Model selection rule 






CV score x 100 (standard error) 




Classification 


A"" 


10.3 (5.8) 


6.9 (4.9) 3.4 (3.5) 3.4 (3.5) 


3.4 (3.5) 


Deviance 


^dcv 


76.5 (37.6) 


67.2 (32.3) 13.7 (8.1) 20.5 (10.0) 


13.8 (7.9) 


Model selection rule 






# selected groups (features) 




Classification 


A"" 


17 (17) 


16 (17) 11 (15) 14 (21) 


9(11) 


Deviance 




19 (19) 


17 (18) 13 (21) 16 (26) 


14 (18) 



which also displays the number of selected groups and features for the mod- 
els selected by minimizing the CV score. 

Expanding the set of probes from d to p slightly improves the perfor- 
mances of the Lasso, and considerable further progresses are brought by all 
group methods, which misclassify about 1 patient among the 29 and quar- 
ter deviance scores.'^ As expected, less genes are selected by group methods; 
the difference is more important for the minimizers of the misclassification 
score, and, among those, for the group-Lasso and coop-Lasso that comply 
more stringently to the group structure. These observations indicate that 
the group structure defined by genes provides truly useful guidelines for 
inference. 

The sparsity numbers differ among the group methods, coop-Lasso select- 
ing as many genes as group-Lasso and fewer probes, and sparse group-Lasso 
retaining slightly more genes and probes. A more detailed picture is provided 
in Figure 6, which shows the regression coefficients for the three group es- 
timators adjusted on the whole data set with their respective A'^'''' values. 
Among the three methods, a total of 15 groups (i.e., genes) are selected. For 
readability, we only represent the 10 leading groups of regression coefficients 
(according to their average norm). We first oberve that the magnitude of co- 
efficients differs for each method, the coop-Lasso having the smallest one. In 
fact, there is a wide range of A'^'"'' values for which the miclassification score 
is minimal for the coop-Lasso, enabling to choose a highly penalized solution 



^A note of caution regarding performances: scores comparisons are fair here, in the 
sense that the CV scores are optimized with respect to a single parameter A, whose role is 
analog for all. Additional simulations (not reported here) show that, for all group methods, 
the CV error is stable with respect to the random choice of folds and that the CV curves 
are smooth around their minima. However, the minimizers of CV are biased estimates 
of out-of-sample scores, and the representativeness of their observed difference can be 
questioned. 
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group-Lasso sparse group- Lasso coop-Lasso 




probe index 



probe index 



probe index 



Fig. 6. Logistic regression coefficients attached to each probe for group, sparse group and 
coop Lasso. Each marker (color and symbol) designates the gene associated to the probe: 

RNPSl (□), MSH6 (A), PRPS2 (•), HIFX (O), MFGES (V), SULFl (•), RNF115 (+), RNF38 

(a), THNSL2 (♦) and edem3 (X). 



without affecting accuracy. The magnitude apart, the group methods have 
quahtatively the same behaviors for all unitary groups but one, with THNSL2 
(♦) being set to zero by the coop-Lasso. The same patterns are observed for 
two other groups, RNPSl (□) and EDEM3 (x), whose regression coefficients 
are consistently estimated to be sign-coherent. Then, sulfI (•), though be- 
ing estimated sign-coherent by the sparse group-Lasso, is excluded from the 
support of the group and coop Lasso. Finally, msh6 (A), estimated as sign 
incoherent with the two groups methods, is excluded from the support for 
the coop-Lasso. 

Overall, the probe enrichment scheme we propose here leads to consider- 
able improvements in prediction performance. This better statistical expla- 
nation is obtained without impairing interpretability, since sign-coherence 
is actually often satisfied by all methods and strictly enforced by the coop- 
Lasso. As often in this type of study, several methods provided similar pre- 
diction performances, but the explanation provided by the coop-Lasso is 
simpler, both from a statistical and from a biological viewpoint. Note that 
the coefficient paths (not shown) diverge early between the group and coop 
methods, so that the above-mentioned discrepancies are not simply due to 
model selection issues. As a final remark, we observed qualitatively similar 
behaviors when the initial number of probes d ranged from 10 to 2000. For 
d < 1000, the group methods always performed best, with approximately 
identical classification errors, the group-Lasso and coop-Lasso slightly dom- 
inating the sparse group-Lasso in terms of deviance. With larger initial sets 
of probes, the enrichment procedure becomes less efficient, and all meth- 
ods provide similar decaying results. The chosen setup displayed here, with 
d = 200, leads to the smallest classification error for all methods, and was 
chosen for being representative of the most interesting regime. 
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7. Discussion. The coop-Lasso is a variant of the group-Lasso that was 
originally proposed in the context of multi-task learning, for inferring re- 
lated networks with Gaussian Graphical Models [Chiquet, Grandvalet and 
Ambroise (2011)]. Here we develop its analysis in the linear regression setup 
and demonstrate its value for prediction and inference with generalized lin- 
ear models. Along with this paper we provide an implementation of the 
fitting algorithm in the R package scoop, which makes this new penalized 
estimate publicly available for linear and logistic regression (the coop-Lasso 
for multiple network inference is also available in the R package simone). 

The coop-Lasso differs from the group-Lasso and sparse group-Lasso [Fried- 
man, Hastie and Tibshirani (2010)] by the assumption that the group struc- 
ture is sign- coherent, namely, that groups gather either nonpositive, non- 
negative or null parameters, enabling the recovery of various within-group 
sign patterns (positive, negative, null, nonpositive, nonnegative, nonnull). 
This flexibility greatly reduces the incentive to drive within-group sparsity 
with an additional parameter that later leads to an unwieldy model selec- 
tion step. However, the relevance of the sign-coherence assumption should 
be firmly established since it plays an essential role in the performance of 
coop-Lasso compared to the sparse group-Lasso. 

Under suitable irrepresentable conditions, the proposed penalty leads to 
consistent model selection, even when the true sparsity pattern does not 
match the group structure. When the groups are sign-coherent the coop- 
Lasso compares favorably to the group-Lasso, recovering the true support 
under the mildest assumptions. 

We present an approximation of the effective degrees of freedom of the 
coop-Lasso which, once plugged into AIC or BIC, provides a fast way to 
select the tuning parameter in the linear regression setup. We provide em- 
pirical results demonstrating the capabilities of the coop-Lasso in terms of 
prediction and parameter selection, with BIC performing very well regarding 
support recovery even for small sample sizes. 

We illustrate the merits of the coop-Lasso applied to the analysis to or- 
dinal and continuous predictors. With an apposite coding, such as forward 
or backward difference coding, the sign-coherence assumption is transcribed 
in a monotonicity assumption, which does not require to stipulate the usual 
and controversial mapping from levels to quantitative variables. Finally, the 
application to genomic data opens a vast potential field of great practical 
interest for this type of penalty, both in terms of prediction and interpretabil- 
ity. Our forthcoming investigations will aim at substantiating this ambition 
by conducting large scale experiments in this application domain. 

APPENDIX: PROOFS 

A.l. Proof of Lemma 1. Let us use 7fc as a shorthand for 5fc(/3), Chiquet, 
Grandvalet and Ambroise (2011) show that the subdifferential 6 obey the 
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following conditions: 

(26a) max{\\e+J\,\\egJ\)<Wk if /3g, = 0, 



(26b) 



ll/3rj 



if ll/3sjl>0,||/3gj|=0, 



(26c) 
(26d) 



if ||/35j|> 0,11/3+ 11=0, 
Vje^fc e,=WkPj\\sign{(3,)(3\\-^ 

if ||/3^J|> 0,11/3+ II >0. 

We thus simply have to prove the equivalence of conditions (8) and (26) for 
all Pg^ values. 

For I3g^ = 0, (8) reads 

(27) PgJ\<Wk and \\6gJ\<Wk, 

which is equivalent to (26a). 

For Pg^ ^ 0, the equalities for O-ji in (26b)-(26d) are equivalent to (8a), 
thus setting the equivalence between (8) and (26) for all nonzero coefficients. 
For (3j-c, let us consider the case (26b), where all nonzero parameters within 

group k are positive. The first equation of (26b) implies that II^t^II = Wk 
and ll^-j^^ll =0. Hence, ||0^c|| <Wk and ||0^c|| =0 imply (27), so that (26b) 
implies (8). The contraposition is also easy to check. From (8a), when all 
coefficients are positive, we have that ||0^ || = and ||0^ || =Wk. Then, this 
implies that (8b) reads 

||0^c||<Wi; and ||0ic||=O, 

'k 'k 

which defines Oj-c in (26b). The proof is similar for (26c) where all nonzero 
parameters within group k are positive. 

A. 2. Proof of Proposition 1. We assume here that X^X = Ip. We intro- 
duce the ridge estimator in the computation of the trace in equation (22), 
through the chain rule, yielding an unbiased estimate of df: 

dy{X)\ /5XT^=°°P(A) 5^"^s<=(7) 



dfcoop(A) =tr — =tr 



K ;^«coop 



a/3^"""p(A) 



1 + 7^^.^ 9/3f^'(7) 
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where the last equation derives from the definition (24) of the ridge esti- 
mator with regularization parameter 7. Then, the expression of the coop- 
Lasso as a function of the ridge regression estimate is simply obtained 
from equation (10), using that, in the orthonormal case, we have 13°^^ = 
(1 -|- 7)^''''^^'^(7). Dropping the reference to A and 7 that is obvious from the 
context, we have, V/c G {1, . . . , K} and Vj G Q^j 

coop _ f-. \ fridge 



(28) /Sr^" =1 (1 + 7)/3- 

Then, for j &Gk, routine differentiation gives 

1 



l(||/3^°°P|| >0) 

, Xw, ( 1 iPf^^ 
1 ' 



(l + 7)V||^,(^^f^)|| 



The summation over the positive and negative elements of Qk reduces to two 
terms 



1 ^ 9/3™°P 



1 + 7 ^ oSridge 



H\mrr\\>o)(p 



> \\ l+7||(^g;^Se) + 

k Xwk {p- - 1) 



'Q. 



k 



\Wk 



= 1(||(/3^°°P)+||>0)+ 1 '-^^^ ) (p^-1) 

viiv^-g, J n > y (l + 7)ll(/3ef')"' 

+ l(||(^^°°P)-||>0) + fl ) (pl-1). 

yny^Gk J w J y (l + 7)ll(/3gf')-" 

From (28), we have, V/c G {1, . . . , K} and Vj G Qk, 

Xwk Y 1 W^j^^g, 



coop\ 



k 



(l + 7)IIV',0gf^)l|/ l + 7||v',(^^,f^)||' 

which is used twice to simplify the previous expression. Summing over all 
groups concludes the proof. 

A. 3. Proof of Theorem 2. Our asymptotic results are established on the 
scaled problem (18). We then follow the three steps proof technique proposed 
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by Yuan and Lin (2007) for the Lasso and also applied by Bach (2008) for 
the group-Lasso: 

(1) restrict the estimation problem to the true support; 

(2) complete this estimate by outside the true support; 

(3) prove that this artificial estimate satisfies optimality conditions for 
the original coop-Lasso problem with probability tending to 1. 

Then, under (A2), the solution is unique, leading to the conclusion that 
the coop-Lasso estimator is equal to this artificial estimate with probability 
tending to 1, which ends the proof. Note, however, a slight yet important 
difference along the discussion: since we authorize divergences between the 
group structure {Gk}k=i ^^^^ support 5, the irrepresentable condi- 

tions (A4)-(A5) for the coop-Lasso cannot be expressed simply in terms of 
coop-norms [as it is done with the group-norm in Bach (2008)]. We will see 
that this does not impede the development of the proof. 

As a first step, we prove two simple lemmas. Lemma 2 states that the 
coop-Lasso estimate, restricted on the true support S, is consistent when 
A„ — )• 0. Lemma 3 provides the basis for the inequalities (16) and (17) that 
express our irrepresentable conditions. 

Lemma 2. Assuming (A1)-(A3), let he the unique minimizer of the 
regression problem restricted to the true support S: 

^S = argmin-||y-X.5v||2 +A„ Yl MhsJ + W^sJ)' 

where ||-||n = ||'||/?^ denotes the empirical norm. 
IfXn^O, then (31;. 

Proof. This lemma stems from standard results of M-estimation [van der 
Vaart (1998)]. Let £ = y - X/3*, and write = XTX/n. If A„ 0, then 
under (Al)-(A2), for any v e mI*^! 

^n(v) = ^||y-X5v||2+A„ + 11^5,11) 

= lm- v)^*g5(/3S - v) - Is^^APs - V) + ^ 

tends in probability to 

It follows from the strict convexity of Z„ that argminZ„(v) — > argminZ(v) = 
[Knight and Fu (2000)], which ends the proof. □ 
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Lemma 3. Consider a sequence of random variables Sn such that Sn — > 
S. Suppose there exists 6 > such that for a given norm fi the limit S is 
bounded away from 1: 

KS) <i-s. 

Then, 

n^Sn) < 1) ^ 1. 

Proof. By triangular inequality and thanks to the constraint on fJ-{S), 

n^Sn) < 1) > n^Sn -S)<1- /x(5)) > F{fx{Sn - S) < 5) , 

Convergence in probability of 5„ to S concludes the proof: 

P(/x(5„ -S)<6)^1 therefore P(/i(5„) < 1) ^ 1. □ 

Let us consider the full vector (3"" with coefficients (3^ defined as in 
Lemma 2 and other coefficients null, /3^c = 0. We now proceed to the last 
step of the proof of Theorem 2, by proving that /3" satisfies the coop-Lasso 
optimality conditions with probability tending to 1 under the additional 
conditions (A4)-(A5). The final conclusion then results from the uniqueness 
of the coop-Lasso estimator. 

First, consider optimality conditions with respect to (3g. As a result of 
Lemma 2, the probability that /3" 7^ for every j £ S tends to 1. Thereby, 

(3^ satisfies (9a) on the restriction of X to covariates in S with proba- 
bility tending to 1. As /9gc = 0, then X/fl*^ = ^.sPs every j G S, 
\\(Pj{Ps^)^\\ = \\Vji(3g^)\\, therefore, satisfies (9a) in the original prob- 
lem with probabihty tending to 1. 

Second, (3^c should also verify the optimality conditions (9) with prob- 
ability tending to 1. With assumption (A3), we only have to consider two 
cases that read: 

• if group k is excluded from the support, one must have 

P(max(||((X.5e)T(X^" - y))+||„, ||((X.5^)t(x3" - y))-|U < A„«;,) 

(29) 

• if group k intersects the support, with either positive (i^k = 1) or negative 
(z^fc = — 1) coefficients, one must have 

(30) P({z.fc(X.5;:)T(X^" - y) ^ 0} n {||(X.5,c)T(x^" - y)||„ < A„«;,}) ^ 1- 

To prove (29) and (30), we study the asymptotics of (X.5c)t(X/3" — y)/n 
for any group such that 5^ is not empty. As a consequence of the existence 
of the fourth order moments of the centered random variables X and Y , the 
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multivariate central limit theorem applies, yielding 

n n 
n n ^-^ 

2 = 1 

Then, we derive from (31) and the definition of /3" that 

l(X.5c)T(X^" - y) = 1(X.5^)TX(^" - - i(X.5,c)T£ 

(31) = ^(X.5^)TX.5(^S - P%) + Op(n-V2) 

= *5?5(^S-/3S) + Op(n-V2), 



while the combination of (31) and optimality conditions (9a) on leads to 

(32) ^ssm - f3*s) = -AnD(^g)^g + Op(n-i/2), 

where D(-) is the weighting matrix (15). Put (31) and (32) together to finally 
obtain 

(33) l(X5c)T(X^- - y) = -K^s^,s^sl^{&s + Op{n-'/^) 
Now, define for any k such that 5^ is not empty: 

^M = ^-(X.5,^)'(X/3"-y) and Rk = -—^s^,s^ss^if^s)(3s- 
Limits (29) and (30) are expressed: 

• if group k is excluded from the support, one must have 

P(max(P+J|,||i?-J|)<l)^l; 

• if group k intersects the support, with either positive {i^k = 1) or negative 
(ffc = — 1) coefficients, one must have 

n{l^kRk,n hO}n {\\{ukRk,n)^\\ < 1}) ^ 1. 

Remark that, as a continuous function of (3^, D(/3^)/3^ converges in prob- 
ability to 'D{(3'^)(3'^. Therefore, with a decrease rate for A„ chosen such that 
n^/^A„— )-oo, equation (33) implies 

(34) R^^^^R^, 

It now suffices to successively apply Lemma 3 to the appropriate vectors 
and norms to show that /3^c satisfies (29) and (30): 

• if group k is excluded from the support, (A4) assumes that there exists 
ry > 0, such that 

max(p+||,||i?-||)<l-r/. 
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• if group k intersects the support, with either positive (i^k = 1) or negative 
(i^fe = — 1) coefficients, 



As previously, the first probabihty in the sum tends to because of (A4) 
and Lemma 3. The second probabihty tends to from (A5) and of the 
convergence in probabihty of Rk,n to Rk- Therefore, the overah probabihty 
tends to 1. 

Denote by „ these events on which coefficients in are set to 0. We 
just showed that individuahy for each group k with true nuh coefficients, 
P(Ak^n) — ^ 1- This imphes that 
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which in turn concludes the proof: 
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